Reproducibile workflows

Version Control and Computational Notebooks

John Little

Duke University Libraries

Center for Data & Visualization Sciences

2024-02-12

Article Production

Reproduction



Authoring and computation environment should enable the articulation of scholarship within a reproducible context

Reproducibility Pyramid.  CC BY John Little & Sophia Lafferty-Hess

Reproducibility Pyramid ● Little & Lafferty-Hess (2020)

Features

  • Support composable recombination
  • Accommodate multimedia expression
  • Provide rich reporting expressions
  • Support economical portability and degrade gracefully
  • Support extensibility
  • Ensure transparency
  • Support a documentary-style project history
  • Accommodate change and collaboration
  • Be citable

Three points

  1. Notebooks (Literate Coding)
  2. Version Control (Git & GitHub)
  3. Sharing (Zenodo, Containers)

Notebooks

Reproducibility

  • Do everything with code!

    • Helps reduce repetion errors
    • Helps avoid copy/paste barriers
    • Orchestrate workflows

Computational Notebooks

  • Authoring environment

    • Code chunks interspersed with natural language
    • aka Literate Coding
  • Easy to read and compose

  • Graceful degradation

Reports and Expressions


Reports expressions are rendered at code execution



Interactivity and web applications

  • Shiny
  • Flask
  • WebR
  • Plotly Dash
  • ObjservableJS

Quarto Notebook in RStudio

Jupyter Notebooks

Quarto

  • A scientific publishing system
  • R, Python, ObservableJS
  • Compose with standard text editors, or basic IDEs
    • IDEs: RStudio, Jupyter, VSCode

Rendered Outputs

  • Artifacts that document a body of work
  • Are reproducible and modifiable when data or techniques change
  • Easy to update natural language explanations and re-render outputs
  • Schedule emails based on report parameters

Summary of benefits

  • Using natural language clearly explain data, models, and workflows
  • Reduce dependencies on outside and undocumented steps
  • Ability to expose technical code chunks depending on audience focus
  • State of the art reproducibility
    • 21st century container for evidence-based, computationally-processed research

Version Control

Definition

  • A system to manage projects (repo)
  • A system to track how computer files change over time
  • A system that support collaborative revision
  • More than file synchronization
  • Assists in project back-ups

Git

  • Free open source
  • Wildly successful; most broadly implemented
  • In use across the globe
  • Use it on any file system
  • Track any file
  • Use it in any environment

Scalable to project size

Project Repositories

Archival vs version-control



  Zenodo logo - Posterity of milestones

Git - track evolution of workflow (i.e. transparency)

Track change


Branches

GitHub

  • Profile (store and host) git repos
  • Enable collaboration across the globe or private
  • Editorial and fine-grain control

Git + GitHub

Hubs

  • GitHub
  • GitLab
  • BitBuckent

Duke specific hubs

  • gitlab.oit.duke.edu (NetID)
  • PACE
  • Anywhere that data and coding happens.

File Distribution and Collaboration

Other project management features


Basic features

Git features implemented for distribution

  • Push
  • Public or Private
  • Clone / Fork
  • Pull Request
  • Pull

Push

Clone

Fork / PR

Summary

  • Git is used to track changes to your repo
  • GitHub is used to distribute your git repo and facility collaboration

Containers

Sharing your workspace


Your computation workspace (i.e. your laptop, desktop, cloud)

Give someone else your laptop so they can play around with your projects

  • the code, the data, the settings and configurations?
  • Good idea?

Now you can share a copy of your computational environment

How

  • Binder: package and share reproducible computational environments
    • mybinder.org (public BinderHub portal)
  • Zenodo: general, open repository to deposit research papers, data sets, code, reports and related artifacts and connect to a citable DOI.
  • Combine GitHub releases with Zenodo to archive your milestones and share the interactive computation in a binder Hub

Binder Hub

  • Easiest: mybinder.org open and public
    • quarto use binder
  • Security demands may push you to use singularity

Steps

  1. Make a GitHub Release at project milestone(s)
  2. Connect GitHub to Zenodo
    1. Mint a DOI to a GitHub Release (persistent identifier: citation; milestones)
    2. With DOI, link to ORCID
  3. Create a publicly launchable, fully functional computation container of your work

Examples

  • https://github.com/libjohn/workshop_rfun_iterate?tab=readme-ov-file#readme
  • https://github.com/libjohn/workshop_webscraping?tab=readme-ov-file#readme